In this study, we present the results of analysing the mis-splicing noise across different samples before and after the shRNA knockdown of a specific target gene, responsible for the production of different RNA binding proteins (RBPs).
More precisely, from the 356 RBPs studied by Van Nostrand et al. in 2020 in his publication A Large-Scale Binding and Functional Map of Human RNA Binding Proteins, 115 of them were functionally categorized under the following categories: splicing regulation, spliceosome or exon junction complex. We also added a list of 118 genes classified as involved in nononsense-mediated decay processes. However, only 56 of those were also profiled by ENCODE fulfilling the following requirements:
In total, we required 8 samples per studied RBP/NMD, divided in 4 cases and 4 controls which, at the same time, are divided in 4 cell line HepG2 and 4 cell line K562.
Across this document, we will refer to each possible target gene as a project and each sample type (i.e. control or case) as clusters. That is, we will study 56 different projects, each one consisting of two clusters with 4 samples each. We will also refer as RBPs to both RBP and NMD genes.
To properly execute this .Rmd, we need to define some variables:
# Data location
RBP_path <- here::here("RBPs/")
metadata_path <- here::here("Metadata/metadata_complete.tsv")
# Loading the metadata
metadata <- readr::read_delim(metadata_path, show_col_types = F)
target_RBPs = metadata %>%
dplyr::filter(if_any(c(Splicing_regulation, Spliceosome, Exon_junction_complex, NMD), ~ . != 0)) %>%
dplyr::pull(target_gene) %>%
unique()
metadata_RBPs <- metadata %>%
dplyr::filter(target_gene %in% target_RBPs) %>%
tidyr::pivot_longer(c("Splicing_regulation", "Spliceosome", "Exon_junction_complex", "NMD"),
names_to = "Category") %>%
dplyr::filter(value == 1) %>%
dplyr::select(-value) %>%
dplyr::distinct(target_gene, sample_id, .keep_all = T)
required_clusters <- metadata_RBPs %>% dplyr::pull(experiment_type) %>% unique()
# Parameters of the script
overwrite_results = F
common_introns_path = here::here("variables/global_common_introns.rds")
common_novel_path = here::here("variables/global_common_novel.rds")
MSR_A_tests_path <- here::here("variables/MSR_A_tests.rds")
MSR_D_tests_path <- here::here("variables/MSR_D_tests.rds")Using the ENCODE REST API, we automated the metadata extraction of the different experiments. By setting a series of filters in their search portal, we extracted a total of 170 different target genes of the shRNA knockdown with the previous requirements. As mentioned, only 56 of those were also studied by Van Nostrand et al. in his publication.
The relevant metadata of the RBPs studied in this document are the following:
The study of the metadata suggest that all ENCODE experiments were performed identically, and that the samples were extracted from the same two cell lines: K562 and HepG2. As such, all results can be compared between against each other.
For more information about the metadata extraction process, please refer to the corresponding repository: ENCODE Metadata Extraction.
For each RBP/NMD target gene, the process is split in several steps:
Download and extraction of BAM files: from the ENCODE Experiment search portal, the files related to each RBP/NMD are automatically downloaded, both case and control samples.
Junction extraction: all junctions are extracted using regtools junction extract after sorting and indexing with samtools. A file is created for each BAM file in BED12 format.
Junction annotation: the junctions are read from the previously created files and merged into a single data.frame of read junctions. We also register the number of reads of each junction in every sample. The junctions located within the ENCODE blacklisted regions v2 are removed and the MaxEntScan is calculated. The junction_annot() function from the package dasper is used to annotate the junctions to a reference transcriptome v105. All junctions not classified as either novel_donor, novel_acceptor or annotated are ignored. We also remove all junctions smaller than 25bp (base pairs) and those annotated introns that are ambiguously assigned to more than one gene.
Junction pairing: by looking for overlaps between the novel junctions and the annotated junctions for each sample, we measure the distance in bp between the novel and reference splice site. The annotated introns that are never associated to a novel junction are considered a never misspliced junctions.
Filtering the distances: we remove the pairings in which the novel junctions is associated to more than one reference intron across different samples. For more information about this process, please see the methods section in Introverse paper.
Next, we need to decide on a clustering method to combine and compare different samples. In this research, each cluster corresponds to a different target gene and experiment type (i.e. case and controls). As such, four samples per cluster are employed:
Measuring the mis-splicing ratio: by summing together all novel junction read counts attached to an annotated intron across all samples in which the novel splice was observed, and then dividing by the total number of reads of the annotated intron and the novel junctions across the same set of samples, we obtain a measurement of the mis-splicing ratio for an given annotated intron at both the donor splice site and the acceptor splice site. For more information about the mis-splicing ratio, please see section MSR.
Generation of the DB: two tables are created per each cluster: db_introns and db_novel. Each one contains the relevant information related to reference introns (including the never misspliced introns) and novel junctions. This includes the MaxEntScan scores, the percentage of protein-coding transcripts and the classification in u2 and u12 introns.
In order to compare the different projects, we only considered the common annotated introns across all projects and clusters. To do so, we generated the following data.frames:
Common annotated intron table: looped through every db_introns table and extracted only the information from the common annotated introns to all projects and clusters. To identify common annotated introns, we employed their locus (i.e. seqname:start-end:strand), since it is a unique identifier. The goal was to have the same number of annotated introns per project.
Common novel junction table: looped through every db_novel table and extracted only the information from the novel junctions associated to common annotated introns. Thus, we first needed to calculate the common annotated intron table. Unlike before, we can have a different number of novel junctions per project or cluster.
For each annotated intron, two Mis-Splicing Ratios (\(MSR\)) are calculated to provide a measurement of the mis-splicing frequency at the donor site (\(MSR_D\)) and acceptor site (\(MSR_A\)). To calculate these measurements, we first sum all of the novel donor/acceptor junction read counts and then divide by the sum of all annotated intron and novel junction read counts across the specific samples. It follows this formula:
\[ \begin{equation} MSR_A = \frac{\sum_{i=1}^{N}j_i}{\sum_{i=1}^{N}j_i+\sum_{i=1}^Ns_i } \end{equation} \] where \(j\) is the number of novel acceptor junction reads for a particular annotated intron, \(s\) is the number of annotated intron reads and \(N\) is the number of samples being studied. We can generate an \(MSR\) table in which each row corresponds to an annotated intron and each column to a cluster. Example of the generated \(MSR_A\) tables:
| Ref. junc. ID | MSR_A ADAR case | MSR_A ADAR control | MSR_A AQR case | MSR_A AQR control |
|---|---|---|---|---|
| 8404 | 0.024 | 0.000 | 0.022 | 0.034 |
| 12732 | 0.048 | 0.152 | 0.210 | 0.137 |
| 16759 | 0.000 | 0.004 | 0.003 | 0.000 |
| 17015 | 0.070 | 0.035 | 0.039 | 0.060 |
| 17681 | 0.321 | 0.171 | 0.106 | 0.182 |
| 19089 | 0.013 | 0.038 | 0.077 | 0.000 |
| 19097 | 0.000 | 0.000 | 0.000 | 0.000 |
| 20250 | 0.000 | 0.000 | 0.000 | 0.004 |
| 32980 | 0.000 | 0.007 | 0.105 | 0.003 |
| 32996 | 0.000 | 0.000 | 0.000 | 0.000 |
Once we have calculated the \(MSR_A\) and \(MSR_D\) for every annotated intron and every project, we use the paired Wilcoxon signed rank test to study if there is a significant variation in the median \(MSR\) in cases vs. controls. To be more precise:
For every project, we will have two p-values (one for each splice site) to reject or not the null hypothesis in favour of the alternative hypothesis.
For every project, we define the MSP as the number of mis-spliced annotated introns divided by the total number of annotated introns (i.e. the percentage of annotated introns that are found to be mis-spliced). This metric is calculated for both the control and case samples.
\[ \begin{equation} MSP = \frac{\text{# Mis-spliced annotated introns}}{\text{# Annotated introns}} \end{equation} \]
To do so, we first count the number of annotated introns for each project (this number is the same across all projects) and the number of mis-spliced annotated introns in cases and control. We calculate the MSP for each cluster, and extract the difference as \(MSP_{case} - MSP_{control}\).
There are two important statistics that we will study related to novel junctions: the number of unique novel junctions and the number of reads associated to novel junctions.
For each cluster, we will measure the number of unique novel junctions associated to the common annotated introns. This measurement is calculated for the different splice sites.
We study the variation in the number of unique novel junctions for each project by dividing the number in the case samples by the number in the control sample.
\[ \begin{equation} \text{Variation in unique novel junctions [%]} = \left(\frac{\text{# Unique novel junction}_{case}}{\text{# Unique novel junctions}_{control}}-1\right)*100\% \end{equation} \]
To measure the percentage of novel reads, we sum all reads of the novel junctions in a given cluster and divide it by the total number of reads in the cluster. We obtain the percentage of reads that correspond to novel junctions in the cluster.
\[ \text{Percentage of novel reads [%]}=\frac{\text{# Novel junction reads}}{\text{# Total reads}} *100\% \]
As before, we will divide the case percentage by the control percentage to study the variation between case and control.
\[ \text{Variation in percentage of novel reads [%]}=\left(\frac{\text{Percentage of novel reads}_{case}}{\text{Percenrage of novel reads}_{control}}-1\right)*100\% \]
From the results of the paired Wilcoxon ranked test, we keep only those projects in which the null hypothesis can be rejected (i.e. p-value after Bonferroni correction \(<= 0.05\)). The following tables shows the results for:
From the previous tables, the relevant parameter is the effect size of the Wilcoxon paired rank test. It is calculated using the Z statistic and the number of annotated introns studied, but it can be directly calculated using the function rstatix::wilcox_effsize(). Using the interpretation found in the functions’ documentation, we categorize the effect size in: small, moderate and large. We can graphically represent the effect values for all projects where the null hypothesis of the Wilcoxon signed rank test is rejected.
The X-axis represents the effect size while the Y-axis shows the results for each target gene in which the median \(MSR\) difference was significant. Positive effect sizes represent that the median \(MSR\) in the case samples is higher than for the control samples. The colour is used for the splice site, being blue for the acceptor site and green for the donor site.
If we study now the variation in the percentage of mis-spliced annotated introns, we obtain the following table:
We can also represent the results in the following graph, where the X-axis shows the difference column from the previous table. The Y-axis represents each target gene, divided by functional category. The colour only represents whether the variation is positive or negative.
The results of the variation in the number of unique novel junctions are found in the following table:
Results are also shown in the following graph, where the X-axis shows the variation column from the previous table. The Y-axis represents each target gene, divided by functional category and splice site (donor and acceptor).
IMPORTANT: The novel acceptor value for target gene AQR cannot be represented in the same axis as the other target genes. The value goes to 429% increase.
The results of the variation in the percentage of novel reads are found in the following table:
Results are also shown in the following graph, where the X-axis shows the variation column from the previous table. The Y-axis represents each target gene, divided by functional category and splice site (donor and acceptor).
IMPORTANT: The novel acceptor value for target gene AQR cannot be represented in the same axis as the other target genes. The value goes to 497% increase.
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23)
## os Ubuntu 20.04.4 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2023-03-14
## pandoc 2.18 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## BiocGenerics * 0.42.0 2022-04-26 [1] Bioconductor
## bit 4.0.5 2022-11-15 [1] RSPM (R 4.2.0)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.2.0)
## bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.0)
## bookdown 0.32 2023-01-17 [1] RSPM (R 4.2.0)
## bslib 0.4.2 2022-12-16 [1] RSPM (R 4.2.0)
## cachem 1.0.7 2023-02-24 [1] RSPM (R 4.2.0)
## cli 3.6.0 2023-01-09 [1] RSPM (R 4.2.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
## colorspace 2.1-0 2023-01-23 [1] RSPM (R 4.2.0)
## crayon 1.5.2 2022-09-29 [1] RSPM (R 4.2.0)
## crosstalk 1.2.0 2021-11-04 [1] RSPM (R 4.2.0)
## digest 0.6.31 2022-12-11 [1] RSPM (R 4.2.0)
## doParallel * 1.0.17 2022-02-07 [1] RSPM (R 4.2.0)
## dplyr * 1.1.0 2023-01-29 [1] RSPM (R 4.2.0)
## DT 0.27 2023-01-17 [1] RSPM (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
## evaluate 0.20 2023-01-17 [1] RSPM (R 4.2.0)
## fansi 1.0.4 2023-01-22 [1] RSPM (R 4.2.0)
## farver 2.1.1 2022-07-06 [1] RSPM (R 4.2.0)
## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.2.0)
## forcats * 1.0.0 2023-01-29 [1] RSPM (R 4.2.0)
## foreach * 1.5.2 2022-02-02 [1] RSPM (R 4.2.0)
## generics 0.1.3 2022-07-05 [1] RSPM (R 4.2.0)
## GenomeInfoDb * 1.32.4 2022-09-06 [1] Bioconductor
## GenomeInfoDbData 1.2.8 2022-05-02 [1] Bioconductor
## GenomicRanges * 1.48.0 2022-04-26 [1] Bioconductor
## ggforce 0.4.1 2022-10-04 [1] RSPM (R 4.2.0)
## ggnewscale * 0.4.8 2022-10-06 [1] RSPM (R 4.2.0)
## ggplot2 * 3.4.1 2023-02-10 [1] RSPM (R 4.2.0)
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.2.0)
## gtable 0.3.1 2022-09-01 [1] RSPM (R 4.2.0)
## here * 1.0.1 2020-12-13 [1] RSPM (R 4.2.0)
## highr 0.10 2022-12-22 [1] RSPM (R 4.2.0)
## hms 1.1.2 2022-08-19 [1] RSPM (R 4.2.0)
## htmltools 0.5.4 2022-12-07 [1] RSPM (R 4.2.0)
## htmlwidgets 1.6.1 2023-01-07 [1] RSPM (R 4.2.0)
## httr 1.4.5 2023-02-24 [1] RSPM (R 4.2.0)
## IRanges * 2.30.1 2022-08-18 [1] Bioconductor
## iterators * 1.0.14 2022-02-05 [1] RSPM (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
## jsonlite 1.8.4 2022-12-06 [1] RSPM (R 4.2.0)
## kableExtra 1.3.4.9000 2023-01-30 [1] Github (haozhu233/kableExtra@292f607)
## knitr 1.42 2023-01-25 [1] RSPM (R 4.2.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.2.0)
## lifecycle 1.0.3 2022-10-07 [1] RSPM (R 4.2.0)
## lubridate * 1.9.2 2023-02-10 [1] RSPM (R 4.2.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
## MASS 7.3-58.2 2023-01-23 [1] RSPM (R 4.2.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.2.0)
## patchwork * 1.1.2 2022-08-19 [1] RSPM (R 4.2.0)
## pillar 1.8.1 2022-08-19 [1] RSPM (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
## polyclip 1.10-4 2022-10-20 [1] RSPM (R 4.2.0)
## purrr * 1.0.1 2023-01-10 [1] RSPM (R 4.2.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
## Rcpp 1.0.10 2023-01-22 [1] RSPM (R 4.2.0)
## RCurl 1.98-1.10 2023-01-27 [1] RSPM (R 4.2.0)
## readr * 2.1.4 2023-02-10 [1] RSPM (R 4.2.0)
## rlang 1.0.6 2022-09-24 [1] RSPM (R 4.2.0)
## rmarkdown 2.20 2023-01-19 [1] RSPM (R 4.2.0)
## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.0)
## rstudioapi 0.14 2022-08-22 [1] RSPM (R 4.2.0)
## rvest 1.0.3 2022-08-19 [1] RSPM (R 4.2.0)
## S4Vectors * 0.34.0 2022-04-26 [1] Bioconductor
## sass 0.4.5 2023-01-24 [1] RSPM (R 4.2.0)
## scales 1.2.1 2022-08-20 [1] RSPM (R 4.2.0)
## sciRmdTheme 0.3 2023-03-10 [1] local
## sessioninfo * 1.2.2 2021-12-06 [1] RSPM (R 4.2.0)
## stringi 1.7.12 2023-01-11 [1] RSPM (R 4.2.0)
## stringr * 1.5.0 2022-12-02 [1] RSPM (R 4.2.0)
## svglite 2.1.1 2023-01-10 [1] RSPM (R 4.2.0)
## systemfonts 1.0.4 2022-02-11 [1] RSPM (R 4.2.0)
## tibble * 3.1.8 2022-07-22 [1] RSPM (R 4.2.0)
## tidyr * 1.3.0 2023-01-24 [1] RSPM (R 4.2.0)
## tidyselect 1.2.0 2022-10-10 [1] RSPM (R 4.2.0)
## tidyverse * 2.0.0 2023-02-22 [1] RSPM (R 4.2.0)
## timechange 0.2.0 2023-01-11 [1] RSPM (R 4.2.0)
## tweenr 2.0.2 2022-09-06 [1] RSPM (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [1] RSPM (R 4.2.0)
## utf8 1.2.3 2023-01-31 [1] RSPM (R 4.2.0)
## vctrs 0.5.2 2023-01-23 [1] RSPM (R 4.2.0)
## viridis 0.6.2 2021-10-13 [1] RSPM (R 4.2.0)
## viridisLite 0.4.1 2022-08-22 [1] RSPM (R 4.2.0)
## vroom 1.6.1 2023-01-22 [1] RSPM (R 4.2.0)
## webshot 0.5.4 2022-09-26 [1] RSPM (R 4.2.0)
## withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
## xfun 0.37 2023-01-31 [1] RSPM (R 4.2.0)
## xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
## XVector 0.36.0 2022-04-26 [1] Bioconductor
## yaml 2.3.7 2023-01-23 [1] RSPM (R 4.2.0)
## zlibbioc 1.42.0 2022-04-26 [1] Bioconductor
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
##
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────